Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

How are you feeling right now?

I am excited to have enrolled Open Data Science

What do you expect to learn?

I look forward to learn

  • Data analysis
  • Prediction

Where did you hear about the course?

I heard about it in 1905 through an email from DONASCI

My github link

date()
## [1] "Sun Dec 12 18:14:34 2021"

The text continues here.


Regression and model validation

Describe the work you have done this week and summarize your learning.

Introduction to the data

In this work space, a student feedback data(more information here) collected between 2014-2015 was used to asses the relationship between learning approaches and the students achievement in an introductory statistics course. The dataset underwent data wrangling analysis and will be used to develop a linear regression model and interpret the results.

1. Load the data into your Rworkspace from your work directory

The first step to creating a work directory where we have stored our data

#set the working directory of your R session the IODS project folder
setwd("C:/LocalData/ocholla/IODS-project")

We load the dataset from the folder saved from Data wrangling exercise.We read the file adding that the header= TRUE and stringAsFactors=TRUE to ensure that the factor variables are not displayed as strings.

students2014<-read.csv("Data/learning2014.csv", header = TRUE, stringsAsFactors = TRUE)

Data exploration and analysis

We explore the data structure and its dimensions

str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.5 3.1 3.5 3.5 3.7 4.7 3.7 3.1 4.3 4 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166   7

The data contains 166 observations of 7 variables. The observations refers to the students and the variables refer to student gender (male or female), age (derived the date of birth), exam points, attitude, deep learning (deep), strategic learning (stra) and surface learning (surf).

Summaries of the variables in the data

summary(students2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.600   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.300   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.700   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.666   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.100   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.900   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The mean age of the students in the class is 25 years, with the youngest being 17 years while the oldest at 55 years. The attitude variable was scaled to 1-5, with good attitude was rated at 5.
Female students were almost twice more than the male students in the course, with deep learning having the highest mean value among the three variables. The lowest exam point attained by the students was 7 with the overall class having an average of 22.

Graphical Visualizing of student attitude aganist exam points

library(ggplot2)
#initialize plot with data and aesthetic mapping
p1<- ggplot(students2014, aes( x= attitude, y= points, col = gender))
#define the visualization type (points)
p2<- p1 + geom_point()
#add a regression line
p3<- p2+geom_smooth(method = "lm")
#add a main title and draw the plot
p4<- p3 + ggtitle("student's attitude versus exam points")
p4
## `geom_smooth()` using formula 'y ~ x'

From the plot above, we can deduce that majority of student across both gender had average attitude which coinciding with attaining average exam points. Female students with a negative attitude towards statistics managed to get higher exam points compared to male students with negative attitude.

An advanced plot matrix with ggpairs()

#Access the GGally and ggplot2 libraries
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
#create a more advanced plot matrix with ggpairs()
p<- ggpairs(students2014, mapping = aes (col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
#draw the plot
p

From the scatter plot, across both gender the attitude of the student is highly correlated to the exam points attained. While there is higher likelihood of male students failing in deep learning compared to female students

Regression model

Linear regression model is a statistical approach that is used for modelling relationships between a dependent variable y and one or more explanatory variables X. If there is only one explanatory variable then its called simple linear regression while in more than one variable it is referred as multiple linear regression.
Linear model is divided into two parts namely systematic and stochastic parts. The model follows the equation below \[Y \sim \propto + X\beta_1+X\beta_2.....\beta_k +\epsilon \]

  • where Y is the target variable, we wish to predict the values of y using the values of X
  • \(\propto+X\beta_0+X\beta_1\) is the systematic part of the model
  • \(\beta\) quantifies the relationship between y and x.
  • \(\epsilon\) describes the errors, which is assumed to be normally distributed.

Using the student data set, the exam points has been used as the target variable while surface learning, strategic learning and attitude of the students have be used as independent variables in the model. The three predictors or independent variables were selected based on the correlation between them and the response variable.

my_model<- lm(points~ attitude+stra+age, data = students2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## age         -0.08822    0.05302  -1.664   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

The call formula captures the model fitted, e.g. in this case it is a multilinear regression with points as target variables while attitude, strategic learning and age are the independent variables.

Residuals Residuals are the difference between the actual observed response value (exam points) and the response values that the model predicted.
The residual section is divided into 5 summary points, and it is used to assess how symmetrical distribution of the model across the points on the mean value zero. From the analysis, we can see the distribution of the residual appear abit symmetrical.

Coefficients
Coefficient gives unknown constants that represent the intercept and unknown parameters in terms of the model. Under the coefficient section , there are four variables.

  1. Estimate The coefficient Estimate contains four rows ; the first one is the intercept \(\propto\) value, while the rest correspond to estimates of the unknown parameters \(\beta_1+\beta_2 ....\beta_k\) in our multi linear regression model.

  2. Coefficient -Standard error
    The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of the response variable (exam points). Ideally, this value should be lower than its coefficient.

  3. Coefficient-t value The coefficient t-vale is a measure of how many standard deviations our coefficient estimate is far from 0. When the t-value is far from zero, this would indicate that we can reject the null hypothesis, that is, we declare there relationship between the dependent variables and the target variable exist.In our analysis, the t-statistic value of attitude is relatively far away from zero and is larger relative to the standard error, which would indicate a relationship exist, compared to variables stra and age.

  4. Coefficient -Pr(<t) Commonly referred to as the p-value, a small p-vale indicates that it is unlikely to observe a relationship between the predictors and the target variable due to chance. Typically, a p-value of 5% or less is a good cut -off point. In our model, the p-vales for intercept and attitude are more closer to zero compared to stra and age.

The “signif.Codes” are associated to each estimate, a three star (or asterisks) represent a highly significant p-value. The three predictors were significant (p \(\leqslant\) 0.05), meaning that we can reject the null hypothesis allowing us to conclude that there is a relationship between exam points and three student variables attitude, stra and age.

Residual standard error is a measure of the quality of a linear regression fit. The residual standard error is the average amount that the response (exam points) will deviate from the true regression line.In my_model the response exam points deviated from the true regression line by approximately 5.26 on average. Note that the Residual Standard Error was calculated with 162 degrees of freedom (DOF). DOF are the number of data points that went into the estimation of the parameters used after taking into account these parameters. In this analysis, we had 166 observation (students), intercept and three predictors parameters.

Multiple R-squared
The R-squared also known as coefficient of determination (\(R^2\)) statistics provides the percentage of the variation within the dependent/response/target variable that the independent variables are explaining.In other word, it describes how well the model is fitting the data. \(R^2\) always lies between 0 and 1, with values approaching 0 indicating a regression that does not explain the variance in the response variable well, while a value closer to 1 explains the variance in the response variable).In my_model, the \(R^2\) is 21.82%, meaning that only 22% of the variance found in the response variable (exam point) can be explained by the predictors attitude, stra and age. Adjusted R-squared is shows what percentage of the variation within the dependent variable that all predictors are explaining.

4. Diagnostic plots

Three diagnostic plots were plotted from the model

  1. Residual Vs Fitted
    This plot explores the validity of the model assumptions that are included in the expression \[ \epsilon \sim N(0,\sigma^2) \]
  • The errors are normally distributed
  • The errors are not correlated
  • The errors have constant variance \(\sigma^2\)
  • The size of a given error does not depend on the explanatory variables
  1. Normal Q-Q plot
    Explores the assumption that the errors of the model are normally distributed.When the majority of the residual are falling along the line, then it validates the assumption of normality. Severe deviation of the residuals from the normal line makes the assumption of normality questionable.

  2. Residual vs Leverage plot
    Leverage measures how much impact or influence a single observation has on the model. Not all outliers are influential in regression analysis.Cases which have high influence are usually located at the upper right corner or at the lower right corner. Presence of cases is such spots can be influential against a regression line.When cases are outside the Cook’s distance, meaning they have high Cook’s distance scores) , the cases are influential to the regression results, and the regression results will be altered if they are excluded.

par(mfrow= c(2,2))
plot(my_model, which = c(1,2,5))

In the Residual vs Fitted plot, the residual are distributed without following a distinctive pattern indicating that the linear relation ship was explained by the model.

In the Normal Q-Q plot, most of the residual points are falling along the diagonal line, with little deviation at both ends.Since there is minimal deviation of the residuals from the line, the QQ plot strengthens the assumption of normality of the model

In the leverage plot there are no influential cases as we can see the Cook’s distance as all the cases are well inside of the Cook’s distance lines.

date()
## [1] "Sun Dec 12 18:14:53 2021"

References
1. Rmarkdown for Scientist
2. University of Virginia Library


Logistic Regression Analysis

In this exercise, we explore the use of logistic regression analysis on student achievement data in secondary education in two Portuguese schools. The data attribute include the student grades, demographic, social and school related features. The data was collected by using school reports and questionnaires. More information can be found in the link

Purpose of the analysis is to study the relationships between high/low alcohol consumption and students attributes. The joined dataset used in the analysis exercise combines the two student alcohol consumption data sets. The following adjustment have been made;

Data exploration

1. Load and describe the student alcohol consumption data

#set the work directory
setwd("C:/LocalData/ocholla/IODS-project")
#Load the data
alc<- read.csv("Data/pormath.csv", header = TRUE, stringsAsFactors = TRUE)
#print column names
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"
#structure of the data
str(alc)
## 'data.frame':    370 obs. of  51 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ n         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ id.p      : int  1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
##  $ id.m      : int  2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.p    : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences.p: int  4 2 8 2 2 2 0 0 6 10 ...
##  $ G1.p      : int  13 13 14 10 13 11 10 11 15 10 ...
##  $ G2.p      : int  13 11 13 11 13 12 11 10 15 10 ...
##  $ G3.p      : int  13 11 12 10 13 12 12 11 15 10 ...
##  $ failures.m: int  1 2 0 0 2 0 2 0 0 0 ...
##  $ paid.m    : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
##  $ absences.m: int  2 2 8 2 8 2 0 2 12 10 ...
##  $ G1.m      : int  7 8 14 10 10 12 12 8 16 10 ...
##  $ G2.m      : int  10 6 13 9 10 12 0 9 16 11 ...
##  $ G3.m      : int  10 5 13 8 10 11 0 8 16 11 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...

The dataset contains 370 observation of 51 variables, the variables are a mixture of integers, factors and logical data types

2. Analysis of student variables relationship with alcohol consumption.

Hypothesis on the four variables

  1. sex- male students consume higher alcohol compared to female students
  2. studytime- low study time among the students increase the chances for high use of alcohol
  3. famrel- Good quality family relationship reduces alcohol abuse.
  4. goout- students who have a high tendency of going out with friends are prone to peer pressure which can easily lead to high alcohol consumption

Distribution of alcohol consumption by gender

library(ggplot2)
g1<- ggplot(data= alc, aes(x = high_use, fill= sex))
#define the plot as a bar plot and draw it
g1+ geom_bar()+ ggtitle("Alcohol consumption across students by gender")

High proportion of male students indulge in high alcohol consumption compared to female students, whereas, more female student population consume low alcohol in comparison to the male student with low alcohol intake.

Relationship between sex, quality family relationships and use of alcohol

#*summary statistics by sex and quality family relationships 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

#produce summary statistics by group
alc%>% group_by(sex, high_use)%>% summarise(count= n(), familytime=mean(famrel))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count familytime
##   <fct> <lgl>    <int>      <dbl>
## 1 F     FALSE      154       3.92
## 2 F     TRUE        41       3.73
## 3 M     FALSE      105       4.13
## 4 M     TRUE        70       3.79

Four-fifth of all the female students have low alcohol consumption and they have higher mean in the quality of family relationship compared to female students who had high alcohol consumption. Similarly, male students with high family relationship had low alcohol consumption.

Relationship between sex and incidences of going out with alcohol consumption among the students

#*failures
alc%>% group_by(sex, high_use)%>% summarise(count= n(), going_out=mean(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count going_out
##   <fct> <lgl>    <int>     <dbl>
## 1 F     FALSE      154      2.95
## 2 F     TRUE        41      3.39
## 3 M     FALSE      105      2.70
## 4 M     TRUE        70      3.93

In both genders, higher incidences of going out are associated with high consumption of alcohol

Does high use of alcohol have a connection to study time?

#does high use of alcohol have a connection to romantic relationship?
a2<- ggplot(alc, aes(x= high_use, y = studytime, col= sex))
#define the plot as a boxplot and draw it
a2 + geom_boxplot()+ylab("Hours")+ggtitle("Student study hours by alcohol consumption and sex")

Among students with low alcohol consumption, the female students spent between 5 to 10 hours in their studies compared to male students who spent between 3-5 hours.

On the contrary, female students with high alcohol consumption spent exactly 5 hours for their studies while the make students spent 2 to 5 hours.

Logistic regression model

#find the model with glm()
m<- glm(high_use~sex+studytime+famrel+goout, data = alc, family = "binomial")
#print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + studytime + famrel + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5891  -0.7629  -0.4976   0.8304   2.6937  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2672     0.7312  -1.733  0.08309 .  
## sexM          0.7959     0.2669   2.982  0.00286 ** 
## studytime    -0.4814     0.1711  -2.814  0.00490 ** 
## famrel       -0.4193     0.1399  -2.996  0.00273 ** 
## goout         0.7873     0.1230   6.399 1.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 370.69  on 365  degrees of freedom
## AIC: 380.69
## 
## Number of Fisher Scoring iterations: 4

From the model, the p-values for the four variables are statistically significant. The output supports the hypothesis that gender, study time, going out and quality family relationship affects the student’s alcohol consumption. The positive estimate value on the factor variable sex indicates likelihood of high use of alcohol among male students with reference to the female students

Model coefficients

#Print out the coefficients of the model
coef(m)
## (Intercept)        sexM   studytime      famrel       goout 
##  -1.2672175   0.7959494  -0.4813712  -0.4192830   0.7872896

Based on the regression coefficients, the odds of high use of alcohol increased with students being male and those with a high frequency of going out, while it decreased with in students who allocate more study time and have excellent family relationships. Odd ratio Odd ratio is the ratio of expected “successes” to failures.Higher odds corresponds to a higher probability of success, with the value ranging from zero to infinity.

Transform the coefficients to odds ratios and transform it by exponentiation it to return it as a unit change in the predictor variable (high_use), holding all other predictor variables constant

#compute odds ratios (OR)
OR<- coef(m)%>% exp
#compute confidence intervals (CI)
CI<- confint(m)%>% exp
## Waiting for profiling to be done...
#Print out te odds ratios with their CI
cbind(OR,CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2816141 0.06545486 1.1622100
## sexM        2.2165443 1.31841735 3.7630180
## studytime   0.6179355 0.43751933 0.8576353
## famrel      0.6575181 0.49791884 0.8636137
## goout       2.1974324 1.73873662 2.8198312

With a confidence interval of 95%, the likelihood of high use of alcohol increased by a factor of 2.21 and 2.19 when the student is of male gender and has high tendency of going going respectively. This means by that On the contrary, the odds of high use of alcohol among students who allocate more study time and have good family relationships drops by -38.21% and -34.25%. Meaning in every one hour added in study time chances of high alcohol reduces by 38.21% while improvement of the family relationship reduces high use of alcohol by 34.25%.

Predictive power of the model Exploring the predictive power of the logistic model, through a 2x2 cross tabulation of predictors versus the actual values.

#predict() the probability of high_use
probabilities <- predict(m, type = "response")
#add the predicted probabilities to alc
alc<- mutate(alc, probability= probabilities)
#use the probabilities to make a prediction of high_use
alc<- mutate(alc, prediction= probability>0.5)
#*see the first five original classes, predicted probabilities, 
select(alc, goout, famrel,studytime, sex, high_use,probability, prediction)%>% head(5)
##   goout famrel studytime sex high_use probability prediction
## 1     2      3         4   F    FALSE  0.05335420      FALSE
## 2     4      3         2   F     TRUE  0.41613728      FALSE
## 3     1      4         1   F    FALSE  0.06670564      FALSE
## 4     2      4         3   F    FALSE  0.05657850      FALSE
## 5     1      4         3   F     TRUE  0.02656662      FALSE

The chances of high alcohol consumption declines when the student is female

Confusion matrix (2x2) table

#tabulate the target variable versus the predictions 

table(high_use=alc$high_use, prediction=alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   230   29
##    TRUE     59   52

From the table, 230 low alcohol consumption was corrected while 29 were incorrectly predicted. on the other hand, 52 observation of high alcohol consumption were correctly predicted while 59 were wrongly predicted.

plotting the original high_use versus the predicted in alc

#initialize a plot of 'high_use' versus 'probability' in alc
g<- ggplot(alc, aes(x = probability, y= high_use, col = prediction))
#define the geom as points and draw the plot
g+geom_point()

Accuracy assessment through the confusion matrix

#tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>% prop.table%>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.62162162 0.07837838 0.70000000
##    TRUE  0.15945946 0.14054054 0.30000000
##    Sum   0.78108108 0.21891892 1.00000000

Approximately 19% and 36% were inaccurately predicted for the false and true values respectively by the model.
From the confusion matrix table, the accuracy of the prediction can be calculated through addition of True Negative and Positive (0.62 and 0.14) divided by the summation of both true and false values in both the original set (high_use) and prediction. we get that the prediction accuracy is 0.76, this is a high performance from the model compared to simple guessing strategies.

Cross validation

This is a method of testing a predictive model on unseen data split the data into training and testing data, whereby the training data is used to find the model while the test data is used to make prediction and evaluate the model performance

One round of cross validation involves

  1. partitioning a sample of data into complementary subsets
  2. Performing the analysis on one subset (the training set, larger)
  3. validating the analysis on the other subset (the testing set, smaller)

This process is repeated so that eventually all of the data is used for both training and testing.In CV, the value of a penalty(loss) function (mean prediction error) is computed on data not used for finding the model. A low CV value is an indication of a good model.

Cross validation using 10 K-folds

library(boot)
#define a loss function (average prediction error)
loss_func<- function(class, prob){
  n_wrong<-abs(class - prob)> 0.5
  mean(n_wrong)
}
#average number of wrong prediction
loss_func(class = alc$high_use, prob = alc$probability) 
## [1] 0.2378378
cv<- cv.glm(data= alc, cost= loss_func, glmfit= m, K=10)
#average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2459459

The model had a CV of 0.23 which is much better than that in the DataCamp.

Super bonus

Perform cross-validation to compare the performance of different logistic regression models (=different sets of predictors). Start with a very high number of predictors and explore the changes in the training and testing errors as you move to model with less predictors. draw a graph displaying the trends of both training and testing errors by the number of predictors in the model

#find the model with glm()
summary(m0<- glm(high_use~G3+absences+traveltime+reason+famsize+sex+studytime+famrel+goout, data = alc, family = "binomial"))
## 
## Call:
## glm(formula = high_use ~ G3 + absences + traveltime + reason + 
##     famsize + sex + studytime + famrel + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8894  -0.7459  -0.4418   0.6426   2.7811  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.66560    1.00389  -2.655  0.00792 ** 
## G3               -0.01447    0.04251  -0.340  0.73350    
## absences          0.07678    0.02339   3.282  0.00103 ** 
## traveltime        0.40320    0.19364   2.082  0.03732 *  
## reasonhome        0.34770    0.34200   1.017  0.30930    
## reasonother       1.11530    0.46975   2.374  0.01759 *  
## reasonreputation -0.16462    0.36307  -0.453  0.65025    
## famsizeLE3        0.23888    0.29336   0.814  0.41549    
## sexM              0.87539    0.28243   3.100  0.00194 ** 
## studytime        -0.30531    0.18219  -1.676  0.09379 .  
## famrel           -0.42737    0.14738  -2.900  0.00373 ** 
## goout             0.78725    0.13037   6.039 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 346.82  on 358  degrees of freedom
## AIC: 370.82
## 
## Number of Fisher Scoring iterations: 5
summary(m1<- glm(high_use~absences+traveltime+reason+sex+studytime+famrel+goout, data = alc, family = "binomial"))
## 
## Call:
## glm(formula = high_use ~ absences + traveltime + reason + sex + 
##     studytime + famrel + goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9077  -0.7427  -0.4506   0.6514   2.7420  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.74054    0.85222  -3.216 0.001301 ** 
## absences          0.07805    0.02326   3.356 0.000791 ***
## traveltime        0.41779    0.19046   2.194 0.028263 *  
## reasonhome        0.32961    0.34063   0.968 0.333226    
## reasonother       1.09017    0.46850   2.327 0.019970 *  
## reasonreputation -0.18187    0.36148  -0.503 0.614869    
## sexM              0.89451    0.28156   3.177 0.001488 ** 
## studytime        -0.32030    0.18031  -1.776 0.075671 .  
## famrel           -0.43481    0.14668  -2.964 0.003033 ** 
## goout             0.79177    0.12892   6.142 8.16e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 347.54  on 360  degrees of freedom
## AIC: 367.54
## 
## Number of Fisher Scoring iterations: 5
summary(m2<- glm(high_use~absences+traveltime+sex+studytime+famrel+goout, data = alc, family = "binomial"))
## 
## Call:
## glm(formula = high_use ~ absences + traveltime + sex + studytime + 
##     famrel + goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7951  -0.7373  -0.4716   0.6918   2.5876  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.35248    0.80844  -2.910  0.00362 ** 
## absences     0.07725    0.02257   3.423  0.00062 ***
## traveltime   0.37496    0.18585   2.018  0.04364 *  
## sexM         0.89229    0.27716   3.219  0.00128 ** 
## studytime   -0.38722    0.17561  -2.205  0.02745 *  
## famrel      -0.41690    0.14401  -2.895  0.00379 ** 
## goout        0.76216    0.12560   6.068 1.29e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 355.05  on 363  degrees of freedom
## AIC: 369.05
## 
## Number of Fisher Scoring iterations: 4
date()
## [1] "Sun Dec 12 18:14:56 2021"

Clustering

This exercises utilize Boston data found in MASS package.It includes 506 rows and 14 columns. The data is about housing in surburban area in Boston.
The following variables are included:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.92 loaded
library(tidyr)
library(ggplot2)

Data exploration

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The Boston data set has 506 observations and 14 variables. The observation values were captured either as integers or numeric data types

1.1 Summary statistics

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The summary shows the descriptive statistics of each of the variable in the Boston data. The variable tax had the largest range between the minimum and maximum values while chas had the least range

1.2 Correlation analysis.

1.2.1 plots between the variables.
#plot matrix of the variables
pairs(Boston[1:6])

pairs(Boston[7:14])

1.2.2 Correlation matrix
#between variables in the data
cor_matrix<- cor(Boston)%>% round(digits = 2)
#print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
1.2.3 Visualization of the correlation matrix
corrplot(cor_matrix, method="circle",type= "upper",
         cl.pos="b", tl.pos="d",tl.cex=0.6)

From the correlation plots and tables, the variables tax and rad are highly positive correlated (corr> 0.8). This is further captured in the visualization of the correlation represented by the bigger circle. Medv and lstat, dis and age, and dist and nox have strong negative correlation but not significant.

Data analysis.

2.1 Standardizing the data set

Standardizing is done through scaling. This entails subtracting the column means from the corresponding columns and divide the difference with standard deviation

#* center and standardize variables
boston_scaled<- scale(Boston)
#summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
  • All the data have been standardized with all the variables having a mean of zero (0). Compared to the original data where the data ranged between 0.00 to 711 with varying mean values.

2.2 Creation of a new a categorical variable

Create of a new a categorical variable of the crime rate in the Boston data set from the scaled crime rate

#class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
#change the object to data frame
boston_scaled<- as.data.frame(boston_scaled)
#summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110

The range of the crime rate is between -0.419 (min) to 9.924 (max), with a mean of 0.

2.3 Create break points to be used in the categorical variable

#create a quantile vector of crim and print it
bins<- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

Divides the data into four quantiles.

2.4 Creating break points in the categorical variable

crime<- cut(boston_scaled$crim, breaks= bins, include.lowest= TRUE, labels=c("low","med_low","med_high","high"))

#look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

Low and high crime rate had 127 observation, while med_low and med_high had 126

2.5 Remove existing variable and adding new variable in the dataset

#remove the old crime variable from dataset
boston_scaled<- dplyr::select(boston_scaled, -crim)
#add the new categorical value to scaled data
boston_scaled<- data.frame(boston_scaled, crime)

2.6 Creation of training and test data sets from the dataframe.

Divide the dataset to train (80%) and test sets (20%). Create the train and test variables and removing the original crime variable.

#number of rows in the Boston dataset
n<-nrow(boston_scaled)
#choose randomly 80% of the rows
ind<- sample(n, size=n * 0.8)
#create train set
train<- boston_scaled[ind,]
#create test set
test<- boston_scaled[-ind,]
#save the correct classes from test data
correct_classes<- test$crime
#remove the crime variable from test data
test<- dplyr::select(test, -crime)

Linear Discriminant Analysis

Linear Discriminant Analysis is a classification ( and dimension reduction) method. It finds the (linear) combination of the variable that separate the target variable classes use the categorical crime rate as the target variable and all other variables in the data set as predictor variable

#*Fit the linear discriminant analysis on the train set.
lda.fit<- lda(crime~., data= train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2599010 0.2450495 0.2450495 0.2500000 
## 
## Group means:
##                  zn      indus         chas        nox         rm        age
## low       0.9211717 -0.9281235 -0.084848104 -0.8754701  0.4316838 -0.8971909
## med_low  -0.1331790 -0.3044592  0.006051757 -0.5531705 -0.1106944 -0.3238968
## med_high -0.3642391  0.1852231  0.085589136  0.3727537  0.1065597  0.4373160
## high     -0.4872402  1.0149946 -0.038441925  1.0478615 -0.3623662  0.8010225
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8687318 -0.6767071 -0.7309907 -0.43388765  0.38026618 -0.76812862
## med_low   0.3317006 -0.5445257 -0.5131189 -0.09890293  0.31612307 -0.12808732
## med_high -0.3681799 -0.4319991 -0.3285838 -0.34385276  0.09463739  0.05903649
## high     -0.8660074  1.6596029  1.5294129  0.80577843 -0.84036775  0.85474245
##                 medv
## low       0.51771507
## med_low   0.01543324
## med_high  0.20785218
## high     -0.65917589
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.116268763  0.65170579 -0.95028163
## indus    0.089460920 -0.38133442  0.25877406
## chas    -0.002372005  0.06208568  0.16295949
## nox      0.297101818 -0.69784887 -1.47738446
## rm       0.115567472 -0.11206313 -0.16649439
## age      0.261080132 -0.33160585 -0.09973337
## dis     -0.098035638 -0.23184126  0.02491327
## rad      3.618853029  0.99798693  0.16311167
## tax      0.068890706 -0.03513625  0.35081916
## ptratio  0.159583654  0.08252759 -0.31603643
## black   -0.133652450  0.01622058  0.13747471
## lstat    0.136662468 -0.31509918  0.30829475
## medv     0.022297643 -0.43023296 -0.26452202
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9572 0.0332 0.0096

LDA determines group means and computes , for each individual, the probability of belonging to the different groups of the target variable. The individual observation is then affected to the group with the highest probability score.

Interpreting the results
Prior probabilities of groups : The proportion of training observation in each group.Such that:

  • There are 23.5% of training observations in the low
  • 25% of training in the med_low category
  • 26.48% of training observation in med_high category
  • 25% of training observation in high category.

Group means: represent the group center of gravity. It shows the mean of each variable in each group.
Coefficient of linear discriminant: Shows the linear combination of predictor variables that are used to form the LDA decision rule.
Proportion of trace: shows the variability of each linear dimension, with LD1 having the highest with 94.11 while LD3 with the least at 1.62%

3.1 Draw the LDA (bi)plot

LDA Biplot is designed to show how individual and groups are different.

#create function for lda biplot arrows
lda.arrows<- function(x, myscale=1, arrow_heads=0.1, color="orange",
                      tex= 0.75, choices=c(1,2)){
  heads<- coef(x)
  arrows(x0 =0, y0=0,
         x1= myscale * heads[,choices[1]],
         y1= myscale* heads[,choices[2]], col = color, length=
           arrow_heads)
  text(myscale*heads[,choices], labels=row.names(heads),
                     cex= tex, col=color, pos=3)
}
#target classes as numeric 
classes<- as.numeric(train$crime)
#plot the lda results
plot(lda.fit, dimen=2, col=classes, pch= classes)
lda.arrows(lda.fit, myscale=1)

In LDA biplots the ““arrows” represent the variables. The longer arrows represent more discrimination. In the above plot, rad variable has more variation (larger standard deviation) compared to other variables. This means it is the most influential variable in the model.

In addition, the variables nox and rad are not highly correlated as the angle between them is nearly right angle.Variables nox and zn are negatively correlated.

3.2 Predict a class with the LDA model on the test data

lda.pred<- predict(lda.fit, newdata=test)
#*  cross tabulate the results with the crime categories from the test set.
table(correct= correct_classes, predicted= lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       4        2    0
##   med_low    6      15        6    0
##   med_high   1      10       14    2
##   high       0       0        1   25

The predicted low and med_high categories had higher proportion of correct cluster observations compared to the test clusters, whereas the test clusters had higher proportion of correct observation in high and med_low categories.

K-Means clustering

Reload the Boston dataset and standardize

library(MASS)
data("Boston")
#center and standardize variables
Boston_scaled<- scale(Boston)
#summaries of the scaled variables
summary(Boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
str(boston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...

From the summary, all the variables have a mean value of zero after being standardized

4.1 Calculate the distances between the observations.

4.1.1Euclidean distance matrix
#euclidean distance matrix
dist_eu<- dist(Boston_scaled)
#summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
4.1.2 Manhattan distance matrix
#euclidean distance matrix
dist_man<- dist(Boston_scaled, method = 'manhattan')
#summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

4.2 k-means analysis

Implement K-means analysis with a random number of K-means

#K-means clustering
km<- kmeans(Boston_scaled, centers = 3)
#Plot the Boston_scaled dataset with clusters
pairs(Boston_scaled, col= km$cluster)

All the pairs have been clustered into three clusters- green, black and brown.

4.3 Determine the optimal number of k means

To determine the optimal number of Kmeans we look at how the total within cluster sum of squares (WCSS) behaves when the number of cluster changes.When you plot the number pf clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically

#set set.seed function to avoid the random assigning of initial clusters centers
set.seed(123) #to avoid random
#determine the number of clusters
k_max<-10
#calculate the total within sum of squares
twcss<- sapply(1:k_max, function(k)
  {kmeans(Boston_scaled, k)$tot.withinss})
#visualize the results
qplot(x = 1:k_max, y= twcss, geom= 'line')

From the plot the point at which there was instant change is probably at 2.
This will act as the optimal number of K-means.

Implement the Kmeans with the optimal number of centers = 2

#k-means clustering
km<- kmeans(Boston_scaled, centers= 2)
#plot the Boston dataset with clusters, vary the number of pairs
pairs(Boston_scaled,col = km$cluster)

From the two plot, the two clusters are distinct from each other. This can be attributed to K-means ability to enhance separability of different clusters.

Bonus

  • Perform k-means on the original Boston data with some reasonable number of clusters (>2). Remember to standardize the dataset.
  • Then perform LDA using the clusters as target classes. Include all the variables in the boston data in the LDA model.
  • Visualize the results with a biplot (incode arrows representing the relationship of the original variables to the LDA solution).
  • Interpret the results. Which variables are the most influencial linear separators for the clusters?
1a. Load and standardize the original dataset.
#Reload the Boston datset and standardize the dataset
library(MASS)
data("Boston")

#center and standardize variables
boston_scaled2<- scale(Boston)
#summaries of the scaled variables
summary(boston_scaled2)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

All the variables have been assigned to a mean value of zero.

1b. Perform K-means using some reasonable number of clusters (>2).
#k-means clustering
km<- kmeans(boston_scaled2, centers= 5)
#plot the Boston dataset with clusters, vary the number of pairs
pairs(boston_scaled2,col = km$cluster)

All the pairs have 5 clusters

1c Add the clusters into the scaled boston dataset
#add the clusters into the boston dataset
boston_scaled2<- data.frame(boston_scaled2, km$cluster)
1d. Create training and test dataset from the scaled boston data
#number of rows in the Boston dataset
n<-nrow(boston_scaled2)
#choose randomly 80% of the rows
ind<- sample(n, size=n * 0.8)
#create train set
train2<- boston_scaled2[ind,]
#create test set
test2<- boston_scaled2[-ind,]

The original data was divided into 80% training and 20% testing sets.

1e.Perform LDA using the clusters as target classes. Include all the variables in the boston data in the LDA model.
#*Fit the linear discriminant analysis on the train set.
lda.fit2<- lda(km.cluster~., data= train2)
# print the lda.fit object
lda.fit2
## Call:
## lda(km.cluster ~ ., data = train2)
## 
## Prior probabilities of groups:
##          1          2          3          4          5 
## 0.07425743 0.13613861 0.23267327 0.20792079 0.34900990 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm        age
## 1 -0.1854493 -0.3157316  0.3133786  3.6647712  0.4306404  0.2138181  0.4290913
## 2 -0.4138882  2.2993839 -1.1620998 -0.2007454 -1.1957952  0.6874068 -1.4644349
## 3  1.2068757 -0.4872402  1.0299433 -0.2723291  0.9969853 -0.4856256  0.7840759
## 4 -0.3322134 -0.4808597  0.5005827 -0.2723291  0.4126493 -0.5353672  0.6272386
## 5 -0.3982818 -0.1104688 -0.6999989 -0.2723291 -0.6224712  0.3392337 -0.4783739
##          dis          rad        tax    ptratio       black      lstat
## 1 -0.4433830  0.009638649 -0.0682569 -0.4290486  0.14108828 -0.1081627
## 2  1.6196570 -0.679093526 -0.5526750 -0.8218082  0.34273326 -0.9422548
## 3 -0.8263352  1.635167428  1.5322534  0.8052870 -0.80749211  0.9509017
## 4 -0.5482077 -0.585376631 -0.1979161  0.1772572  0.05114137  0.4788749
## 5  0.4272987 -0.554250499 -0.7413582 -0.3548817  0.37017778 -0.5635095
##         medv
## 1  0.5694394
## 2  0.7486460
## 3 -0.8219329
## 4 -0.4545371
## 5  0.4426882
## 
## Coefficients of linear discriminants:
##                  LD1          LD2           LD3         LD4
## crim     0.034626545  0.011867261 -0.1559040056  0.11802273
## zn       0.297786175 -0.287582562 -1.4864689354 -1.24084138
## indus   -0.054367812  0.370023831  0.1632260886 -0.40109753
## chas    -5.190618709 -0.398296510 -0.2982405218 -0.03794020
## nox      0.005563069 -0.287697635  0.0873598437 -0.49971441
## rm       0.055060927  0.015978997 -0.1245985638  0.09946824
## age     -0.123498535  0.189920047  0.4227915007 -0.28603098
## dis     -0.006874898 -0.462773844 -0.4034456808  0.33518750
## rad     -0.304219079  2.649304175 -1.0824093907  2.02998628
## tax     -0.017448586 -0.004645183 -0.6445853812 -1.10990584
## ptratio -0.016550862  0.029361089  0.1926290390 -0.61540473
## black   -0.002242608 -0.096184981  0.0315333992  0.09825891
## lstat    0.082731274  0.294123773 -0.0028274113 -0.26817140
## medv     0.146425853 -0.308358740 -0.0006178331 -0.03878442
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4 
## 0.6388 0.2588 0.0818 0.0206

Interpreting the results
Prior probabilities of groups : The proportion of training observation in each group.Such that:

  • There are 7.42% of training observations in cluster 1
  • 13.61% of training observation in cluster 2
  • 23.26% of training observation in cluster 3
  • 20.79% of training observation in cluster 4
  • 34.90% of training observation in cluster 5

Group means: represent the group center of gravity. It shows the mean of each variable in each group.
Coefficient of linear discriminant: Shows the linear combination of predictor variables that are used to form the LDA decision rule.
Proportion of trace: shows the variability of each linear dimension as follows:

  • LD1 having the highest of the variability with 63.88%
  • LD2 having 25.88%
  • LD3 8.18%
  • LD4 2.06%

Draw the LDA (bi)plot

#create function for lda biplot arrows
lda.arrows<- function(x, myscale=1, arrow_heads=0.1, color="orange",
                      tex= 0.75, choices=c(1,2)){
  heads<- coef(x)
  arrows(x0 =0, y0=0,
         x1= myscale * heads[,choices[1]],
         y1= myscale* heads[,choices[2]], col = color, length=
           arrow_heads)
  text(myscale*heads[,choices], labels=row.names(heads),
       cex= tex, col=color, pos=3)
}
#target classes as numeric 
classes<- as.numeric(train2$km.cluster)
#plot the lda results
plot(lda.fit2, dimen=2, col=classes, pch= classes)
lda.arrows(lda.fit2, myscale=1)

From the plot, , char variable has more variation (larger standard deviation) and is the most influential variable compared to other variables as represented by the longest arrow, this is followed by variable rad. Moreover, the two variable chas and rad are not highly correlated as the angle between them is nearly a right angle

Super-Bonus

model_predictors<- dplyr::select(train,-crime)
#check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
#matrix multiplication
matrix_product<- as.matrix(model_predictors)%*%
  lda.fit$scaling
matrix_product<-as.data.frame(matrix_product)

Access plotly package to create a 3D plot of the columns of the matrix product.

  • Adjust the code: add argument color as an argument in the plot_ly8) function.
  • set the color to be the crime classes of the train set.
  • Draw another 3 D plot where the color is defined by the clusters of the K-means.
  • How do the plots differ? Are there any similarities?
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x= matrix_product$LD1, y= matrix_product$LD2, 
        z= matrix_product$LD3, type='scatter3d', mode= 'markers')
date()
## [1] "Sun Dec 12 18:15:16 2021"

Dimensionality reduction techniques

Human dataset variables from UNDP. The data combines several indicators from most countries in the world. The variables in the dataset under two categories: Health and knowledge, and empowerment

Load the data and libraries

#set the workdirectory
setwd("C:/LocalData/ocholla/IODS-project/Data")
library(dplyr)
library(corrplot)
library(ggplot2)
#Load the human data 
human<- read.csv("human.csv", header = TRUE)
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  132 109 124 112 113 111 103 123 108 93 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

The dataset contains 155 observation of 8 variables namely

  • “GNI” = Gross National Income per capita
  • “Life.Exp” = Life expectancy at birth
  • “Edu.Exp” = Expected years of schooling
  • “Mat.Mor” = Maternal mortality ratio
  • “Ado.Birth” = Adolescent birth rate
  • “Parli.F” = Percentage of female representative in parliament
  • “Edu2.FM” = Ratio of female (Edu2.F) against males (Edu2.M) with at least secondary education
  • “Labo.FM” = Ratio of females(Labo2.F) aganist males (Labo2.M) in the labour force

Data exploration and visualization

2.1 Summary of the data variables

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI           Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :  1.0   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.: 39.5   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 78.0   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 78.0   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.:116.5   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :155.0   Max.   :1100.0   Max.   :204.80   Max.   :57.50

The variables have varying mean values

2.2 Matrix of the variables

#plot matrix of the variables
library(GGally)
library(dplyr)
library(corrplot)

#visualize the human varibles
ggpairs(human)

2.3 Correlation matrix

#create a correlation matrix to show the correlation 
#between variables in the data
cor_matrix<- cor(human) %>% round(digits = 2)
#print the correlation matrix
cor_matrix
##           Edu2.FM Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM      1.00    0.01    0.59     0.58  0.18   -0.66     -0.53    0.08
## Labo.FM      0.01    1.00    0.05    -0.14 -0.01    0.24      0.12    0.25
## Edu.Exp      0.59    0.05    1.00     0.79  0.12   -0.74     -0.70    0.21
## Life.Exp     0.58   -0.14    0.79     1.00  0.19   -0.86     -0.73    0.17
## GNI          0.18   -0.01    0.12     0.19  1.00   -0.10     -0.14    0.01
## Mat.Mor     -0.66    0.24   -0.74    -0.86 -0.10    1.00      0.76   -0.09
## Ado.Birth   -0.53    0.12   -0.70    -0.73 -0.14    0.76      1.00   -0.07
## Parli.F      0.08    0.25    0.21     0.17  0.01   -0.09     -0.07    1.00

From the matrix,

  • Life.Exp and Mat. Mor are highly negatively corrected (corr = -0.86)
  • Edu.Exp and Life.Exp are positively correlated (corr = 0.79)

2.4 Visualize the correlation matrix

#visualize the correlation matrix
corrplot(cor_matrix, method="circle",type= "upper",
         cl.pos="b", tl.pos="d",tl.cex=0.6)

From the visualization plot,

  • The relationship between Life.Exp and Mat.Mor is represented by a big dark red circle indicating a strong negative correlation between the two variables.
  • Life.Exp and Edu.Exp tends to have positive correlation relationship.
  • Labo.FM, GNI and Parli.F variables have the least correlation with other variables.

3.0 Perfroming Principal Component Analysis

Principal component analysis (PCA) is used to summarize and visualize the information in a data set containing observations described by multiple inter-correlated quantitative variables.

PCA is used to extract the important information from multivariate data table and to express this information as set of few new variables called principal components. These new variables correspond to linear combination of the originals. The number of principal components is less than or equal to the number of original variables.
PCA have the following properties:

  • The 1st principal component captures the maximum amount of variance from the features in the original data.
  • The 2nd principal component is orthogonal to the first and it captures the maximum amount of variability left.
  • The same is true for each principal component. They are all uncorrelated and each is less important than the previous one, in terms of captured variance.

Main purpose of principal component analysis is to:

  • Reduce the dimensionality of data to two or three principal component, that can be visualized graphically, with minimal loss of information.
  • Identify correlated variables.
  • Identify hidden pattern in a data set

3.1.PCA on non-standardized human data

pca_human<- prcomp(human)
#print summary of the pca_human
s<-summary(pca_human)
s
## Importance of components:
##                             PC1      PC2      PC3     PC4     PC5     PC6
## Standard deviation     214.2937 44.75162 26.34667 11.4791 4.06656 1.60664
## Proportion of Variance   0.9416  0.04106  0.01423  0.0027 0.00034 0.00005
## Cumulative Proportion    0.9416  0.98267  0.99690  0.9996 0.99995 1.00000
##                           PC7    PC8
## Standard deviation     0.1905 0.1587
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion  1.0000 1.0000

PC1 has the highest variance at 0.9416 while PC8 had the least at 0.0.
PC1, PC2 and PC3 contribute 99.69% of the cumulative variability. The PC variance decreases as the number of components increase

3.2. Show the variability captured by the principal component

#round the percentage of variance captured by each PC
pca_pr<-round(100*s$importance[2,],
              digits = 1)
#print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 94.2  4.1  1.4  0.3  0.0  0.0  0.0  0.0

The percentage of the variance of all the components

3.3. Draw a bipot displaying the observation by the first two principal components

#create object pc_lab to be used as axis labels
pc_lab<- paste0(names(pca_pr), "(",pca_pr,"%)")

biplot(pca_human, choices = 1:2, cex = c(0.8,1), col = c("grey40","deeppink2"),
       xlab=pc_lab[1],ylab= pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

From the principal component analysis, using the first two principal components, variance of PC1 is 94.2% while PC2 is 4.1%. This means that majority of variance from the features in the original data are contained in PC1.
Maternal Mortality is the most influential variable in PC1, whereas in PC2 its GNI, the two variables are not correlated, as the angle between them is almost right angle. In addition, maternal mortality are also the most influential variable,highest standard deviation, in the data set as it has the longest arrow.

3.4 Standardize the variables in the human data and perfrom PCA analysis

human_std<- scale(human)
#check the summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-1.7154   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.8577   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median : 0.0000   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.8577   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 1.7154   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850

The mean of all the variables have been scaled to zero.

3.5 Perform principal component analysis (with SVD method)

pca_human_std<-prcomp(human_std)
#print summary of the pca_human
s1<-summary(pca_human_std)
s1
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.9659 1.1387 0.9896 0.8662 0.69949 0.54002 0.46700
## Proportion of Variance 0.4831 0.1621 0.1224 0.0938 0.06116 0.03645 0.02726
## Cumulative Proportion  0.4831 0.6452 0.7676 0.8614 0.92254 0.95899 0.98625
##                            PC8
## Standard deviation     0.33165
## Proportion of Variance 0.01375
## Cumulative Proportion  1.00000

PC1 had the highest variance at 0.48 compared to PC8 at 0.013

3.6.Show the variability captured by the principal component

#round the percentage of variance captured by each PC
pca_pr1<-round(100*s1$importance[2,],
              digits = 1)
#print out the percentages of variance
pca_pr1
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 48.3 16.2 12.2  9.4  6.1  3.6  2.7  1.4

Percentage of the variance of the components generated, with PC1 at 48.3% while the least PC8 at 1.4%.

3.7. Draw biplot of the principal component representation and the original variables

#create object pc_lab to be used as axis labels
pc_lab1<- paste0(names(pca_pr1), "(",pca_pr1,"%)")

#draw biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.8,1), col = c("grey40","deeppink2"),
       xlab=pc_lab1[1],ylab= pc_lab1[2])

From the plot:

  • Using the first two PCs, PC1 has a variance of 48.3% while PC2 has a variance of 16.2%.
  • Most of the variable seem to have almost equal influence by the size of their arrows, ranging within -2 to 2.
  • In PC1, Mat.Mor and Ado.Birth are positively correlated as the angle between them is minimal. However, Mat.Mor is more influential compared to Ado.Birth, based on the length of the arrow.
  • Edu.Exp, Edu2.FM and Life.Exp are negatively correlated to Mat.Mor and Ado.Birth in PC1.
  • In PC2 variables Parli.F and Labo.FM were the influential variables

3.8 Comparion between the standardized and non standardized human data PCA results. Give your personal interpretation.On the the first two principal component dimensions based on the biplot drawn after PCA on th standardized human data

  • Are the results different? The results are different.

  • Why or why not? Non-standardized data have high variance or variability compared to the standardized data. This is because non standardized PCA is affected by noise and outliers in the variables.This is demonstrated by the single long arrow seen in the non-standardized PCA for Mat.Mortality. On the other hand standardized variables have standard deviation of one and mean is zero, eliminating the impact of outliers leading to variables having almost the same influence. This can be visualised by the size of arrows in the biplot as they almost have same equal influence in the different components.

4. Multiple Correspondence Analysis (MCA)

MCA is dimensionality reduction method that is used to analyse the pattern of relationship of several categorical variables.
The goal of MCA is to identify the association between variable categories.

Using dataset tea from the FactoMineR package.

The tea data consist tea consumers survey about their consumption of tea.The questions were about how they consume tea, how they think of tea and descriptive questions(sex, age, socio-professional category and sport practise).

4.1 Load the data

#*Load the tea dataset from the package Factominer
library(FactoMineR)
library(ggplot2)
library(tidyr)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
data("tea")
#*Look at the structure and dimensions of the data 
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

The dataset contains 300 observations of 36 variables, except of the age, all variables are categorical.

4.2 Select specific columns (variables)

#columns to keep in the dataset
keep_columns<- c("Tea","How","how","sugar","where","lunch")
#select the keep_columns to create a new dataset
tea_time<- dplyr::select(tea, one_of(keep_columns))
#look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...

The selected columns had 2-4 levels of categorization.

4.3 Visualizing the data set

Plot the frequency of the variable categories

gather(tea_time)%>% ggplot(aes(value))+
  facet_wrap("key", scales= "free")+
  geom_bar()+theme(axis.text.x = 
                     element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

From the plots:

  • how: More than 150 of the consumers preferred to use tea bag.
  • How: More than half of the consumers preferred to take tea alone without addition of other things such as milk or lemon.
  • lunch: Most tea consumer preferred to consume tea at other time rather than as lunch.
  • Sugar: The consumer count who preferred no sugar in their tea was slightly higher than those who added sugar.
  • Tea: Earl Grey tea was the most preferred tea while green tea was the least.
  • where: Majority of the tea consumers bought their tea from chain store with the least from tea shop

4.4 Multiple Correspondence Analysis

#* Multiple Correspondence Analysis
mca<- MCA(tea_time, graph = FALSE)
#summary
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
  • The EigenValues shows the variance and the percentage of variance retained by each dimension. Dimension 1 had the highest variance at 15.23% while Dimension 11 had the least with 3.385%.

  • Individuals- shows the first the individuals coordinates, the individual contribution (%) on the dimension and the cos2(the squared correlation) on the dimensions.

  • Categories: shows the coordinates of the variable categories, the contribution (%), the cos2 (the squared correlations) and v.test value. The v.test follows normal distribution : if the value is below/above \(\pm\) 1.96, the coordinate is significantly different from zero.

  • categorical variables- shows the squared correlation between each variable and the dimensions. if the value is close to one it indicates a strong line with the variable and dimension.

#print the output of the MCA() function
print(mca)
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 6 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$ind"            "results for the individuals"     
## 8  "$ind$coord"      "coord. for the individuals"      
## 9  "$ind$cos2"       "cos2 for the individuals"        
## 10 "$ind$contrib"    "contributions of the individuals"
## 11 "$call"           "intermediate results"            
## 12 "$call$marge.col" "weights of columns"              
## 13 "$call$marge.li"  "weights of rows"

4.5 Visualization and interpretation of MCA

4.5.1 Eigenvalues / variance

#Visualize the proportion of variance retained by different dimensions 
eig.val<- get_eigenvalue(mca)
head(eig.val)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  0.2793712        15.238428                    15.23843
## Dim.2  0.2609265        14.232352                    29.47078
## Dim.3  0.2193358        11.963768                    41.43455
## Dim.4  0.1894379        10.332978                    51.76753
## Dim.5  0.1772231         9.666715                    61.43424
## Dim.6  0.1561774         8.518770                    69.95301

Dimension 1 had the highest eigenvalue (0.279) and variance percentage at 15.23%. The variance percentage decreased as the dimension increased, while the cumulative variance percentage increased.

4.5.2 Visualize the percentages of inertial explained by each MCA dimensions

fviz_screeplot(mca, addlabels = TRUE, ylim=c(0,45))

The plot displays how the percentage of variances was distribution across the dimensions. The variance % declined as dimension increased.

4.5.3 visualize the biplot of individual and variable categories

fviz_mca_biplot(mca,
                repel = TRUE, 
                ggtheme = theme_minimal())
## Warning: ggrepel: 244 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

The plot shows a global pattern within the data. Rows (individuals) are represented by blue points and columns (variable categories) by red triangles.

The distance between any row points or column points gives a measure of their similarity (or dissimilarity). Columns such are tea bag, sugar, chain store and alone have similarity based on the short distnace between them. Compared to variables such as green, tea shop and other.

4.5.4Visualization of the correlation between variables and principal dimensions.

fviz_mca_var(mca, choice = "mca.cor",
             repel = TRUE, 
             ggtheme = theme_minimal())

The plot assist to identify variables that are most correlated with each dimension.
From the plot:

  • The variable sugar is the most correlated variable with dimension 1.
  • Variable lunch is the most correlated variable with dimension 2.

4.5.5 Coordinates of variable categories

Visualizes the variables which belong to similar group and if they are positively or negatively correlated.

#display the coordinates of each variable categories in each dimension
#head(round(var$coord, 2),4)
#visualize only variables
fviz_mca_var(mca, col.var="black", shape.var = 15,
             repel = TRUE)

From the plot:

  • Variables categories with similar profile are grouped together, based on the quadrant.
    1. tea bag, chain store, alone and sugar belonged to group 1
    2. Not lunch, green, unpacked and tea shop belonged to group 2
    3. chain store+tea shop, tea bag + unpackaged, lunch, no sugar, lemon, other and black belonged to group 3.
    4. milk, earl grey belonged to the fourth group.
  • Negatively correlated variables categories are positioned on opposite sides of the plot origin. For example group 1 and 2 are negatively correlated
  • The distance between category points and the origin measures the quality of the variable category on the factor map. Category points that are away from the origin such as unpackaged, tea shop , other and chain store + tea shop are well represented on the factor map, compared to variables closer to the origin such as Not lunch, No sugar, sugar and alone.

4.5.6 MCA factor plot map

#visualize MCA
plot(mca, invisible=c("ind"), habillage="quali", graph.type="classic")

MCA factor map shows the distribution of the variables across the two dimension and how far they are from the origin. Variables closer to the origin has little influence or contribute little to the variability of the dimension.

4.5.7 Contribution of variable categories to the dimensions

The variable categories with the larger value contribute the most to the definition of the dimensions. These variables are the most important in explaining the variability in the data set.

4.5.7.1 Contribution of rows to dimension 1

fviz_contrib(mca, choice= "var", axes=1, top= 20) # top 20 variable categories

In Dimension 1, tea shop, unpackaged , tea bag and chain store are the most important variables in explaining variability in dimension 1 with tea shop being the most important variable contributing approximately 25% of Dimension 1 variability.

4.5.7.2 Contribution of rows to dimension 2

fviz_contrib(mca, choice= "var", axes = 2, top = 20)

In dimension 2, variable chain store +tea shop is the most important variable as it contributes almost 30% of Dimension 2 variability. Other important variables include:

  • tea bag+ unpackaged
  • tea shop
  • unpackaged
  • other
  • green
date()
## [1] "Sun Dec 12 18:15:29 2021"

Longitudinal Data Anlaysis

Wide data has a column for each variable whereas long format data has a column for possible variable type & a column for the values of those variables

# libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(patchwork)
## 
## Attaching package: 'patchwork'
## The following object is masked from 'package:MASS':
## 
##     area
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
library(ggpubr)

Chapter 8: Analysis of Longitudinal Data I: Graphical Displays Summary Measure Approach

Data from a nutrition study conducted in three groups of rats wll be used. the groups were put on different diets and each animals body weight (grams) was recorded repeatedly (approximately) weekly, expect in week seven when two recording were taken) over a 9-week period.

Load the data and check the variables and structure

#set working directory
setwd("C:/LocalData/ocholla/IODS-project/Data")

RATS<- read.table("Rats.txt", header = TRUE, sep = '\t')

#convert variables ID and Group to factors
RATS<- within(RATS, {
  ID<- factor(ID)
  Group<- factor(Group)
})

#Convert data from wide to long form
RATSL<- RATS%>% 
  gather(key = WD, value = Weight, -ID, -Group) %>% 
  #mutate a new variable Time by extracting the number of the day from WD
  mutate(Time = as.integer(substr(WD,3,4)))
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,~
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, ~
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", ~
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470~
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, ~

Plot the RATSL values for all 11 rats mean, differentiating between the nutrition groups into which the rats have been randomized

ggplot(RATSL, aes(x=Time, y= Weight, linetype = WD))+
  geom_line()+
  scale_linetype_manual(values = rep(1:10,
                                     times=4))+
  facet_grid(.~ Group, labeller = label_both)+
  theme(legend.position = "none")+
  scale_y_continuous(limits = c(min(RATSL$Weight),
                                max(RATSL$Weight)))

From the graph, the rats weights scores varied within the groups across the 9 weeks. The rats with higher weight scores at the begining tend tp have higher scores throughout the study.

  • In group 1, the rats had the lowest weight across the period compared to group 2 and 3.
  • In group 2: the weights variation was the largest across the period compared to the other groups.
  • In group 1: the weight was relatively constant across the period

Standardize values of each observation (tracking phenomenon)

Observing the weights of rats from the beginning to the end of the study are clearly displayed through tracking phenomenon. This tracking s done by standardizing the weight scores across the study period. Plot the standardized observations aganist the study period

#standardize the variables weights
RATSL<- RATSL %>%
  group_by(Time) %>% 
  mutate(stdWeight = (Weight - mean(Weight))/ sd(Weight) ) %>% 
  ungroup()

#plot again with the standardized weights

ggplot(RATSL, aes(x = Time, y = stdWeight,
                  linetype = WD))+
  geom_line()+
  scale_linetype_manual(values = rep(1:10,
                                     times = 4))+
  facet_grid(. ~Group, labeller = label_both)+ scale_y_continuous(name = "standardized Weights")

From the plot:

  • The mean of group 2 and group 3 at 1, suggesting a small difference between the two groups.
  • The mean of group 1 is at -1, this diverges from the rest, indicating the difference between group 1 and the other two groups.

With many observations, graphical display of individual observation are often obscured making them little of importance. This creates a need to produce graphs showing average profiles for each nutrition group along with some indication of the variation of the observation at each week point

#number of subjects (days), baseline (week 0) included
n<- 16

#summary data with mean and standard error of RATS by nutrition group and Time
RATSS<- RATSL %>%
  group_by(Group, Time) %>% 
  summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
#plot the mean profiles
ggplot(RATSS, aes(x= Time, y= mean, linetype = Group, shape = Group))+
  geom_line()+
  scale_linetype_manual(values = c(1,2,3))+
  geom_point(size =3)+ #change from 3 to 2
  scale_shape_manual(values = c(1,2,3))+
  geom_errorbar(aes(ymin= mean -se, ymax= mean + se,
                    linetype ="1"), width= 0.3)+
  theme(legend.position = "right")+
  scale_y_continuous(name = "mean(Weight)+/- se(Weight)")

There is no overlap in the mean profiles of the three nutrition groups indicating the difference between the three groups with respect to the mean of rats weight.

Possible alternative is plotting the mean profiles of the three groups to a side by side box plots of the weights at each week weight score.

ggplot(RATSL, aes(x= factor(Time), y = Weight, 
                  fill = Group))+
  geom_boxplot(position = position_dodge(width = 0.9))+
  theme_bw()+theme(panel.grid.major = element_blank(), 
                   panel.grid.minor = element_blank())+
  theme(legend.position = "bottom")+ #c(0.8,0.8)
  scale_x_discrete(name = "Time")

The plot suggest absence of outliers in the weights, also the plot indicates again the general increase of rats weight across the study period in all the three groups

Applying the summary measure approach

The mean of day 8-64 will be the chosen summary measure. This measure is calculated and then displayed using boxplots for each nutrition group.

RATSL8S<- RATSL %>%
  filter(Time > 1 ) %>% # 1 is the baseline date
  group_by(Group, WD) %>%
  summarise( mean = mean(Weight) )%>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
#draw  a box plot of the mean versus Group
ggplot(RATSL8S, aes(x = Group, y =mean)) +
  geom_boxplot()+
  stat_summary(fun.y = "mean", geom = "point", shape = 23, size =4, 
               fill = "blue")+
  scale_y_continuous(name = "mean(Weight), Time 8-64")
## Warning: `fun.y` is deprecated. Use `fun` instead.

From the boxplot:

  • The mean summary measure is not varying across the three diet groups, there seems to be no skew observations.
  • There are no presences of outliers across the three diet groups.

With no outliers the subsequent tasks on removing outliers are not applicable

Apply t-test to assess difference between the nutrition groups

res.aov<- RATSL8S %>% anova_test(mean ~ Group)
## Coefficient covariances computed by hccm()
res.aov
## ANOVA Table (type II tests)
## 
##   Effect DFn DFd        F       p p<.05   ges
## 1  Group   2  27 1091.048 1.5e-26     * 0.988

From the results, there are significant difference between the three deit groups as pvalue <0.05 (P= 1.5e-26) which is highlighted by “*“, F(2,27)=1091, eta2[g]=0.988.

using baseline data

The use of baseline data helps in gaining precision when it is used appropriately as a co variate in an analysis of covariance

Chapter 9: Analysis of Longitudinal Data II: Linear Mixed Effects Models for Normal Response Variables

Longitudinal data, where a response variable is measured on each subject on different occasions poses a challenge as repeated measurement often are likely correlated rather than independent.

To test how linear mixed effect models are applied data taken from Davis 2002 will be used. It comprises of 40 male subjects who were randomly assigned to one or two treatment groups and each subject was rated on the Brief Psychiatric Rating Scale (BPRS) measured before treatment began (week 0) and then at weekly interval for eight weeks. The BPRS assesses the level of 18 symptoms construct such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not presented) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia. The question of most interest is whether BPRS differs in the two treatments differ

9.3.1 Fitting the independence Model to the BPRS Data

Load the data, ignoring the repeated-measure structure of the data and assume that all the observation are independent of one another

#set working directory
setwd("C:/LocalData/ocholla/IODS-project/Data")
#Load the data sets (BPRS)
BPRS<- read.table("BPRS.txt", sep= " ", header = TRUE)
#convert subject and treatment from integer to factors
BPRS<- within(BPRS, {
  treatment<- factor(treatment)
  subject<- factor(subject)
})

#Add column on the number of count of men 
BPRS$ID<- seq.int(nrow(BPRS))
glimpse(BPRS)
## Rows: 40
## Columns: 12
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ week0     <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, ~
## $ week1     <int> 36, 68, 55, 77, 75, 43, 61, 36, 43, 51, 34, 52, 32, 35, 68, ~
## $ week2     <int> 36, 61, 41, 49, 72, 41, 47, 38, 39, 51, 34, 49, 36, 36, 65, ~
## $ week3     <int> 43, 55, 38, 54, 65, 38, 30, 38, 35, 55, 41, 54, 31, 34, 49, ~
## $ week4     <int> 41, 43, 43, 56, 50, 36, 27, 31, 28, 53, 36, 48, 25, 25, 36, ~
## $ week5     <int> 40, 34, 28, 50, 39, 29, 40, 26, 22, 43, 36, 43, 25, 27, 32, ~
## $ week6     <int> 38, 28, 29, 47, 32, 33, 30, 26, 20, 43, 38, 37, 21, 25, 27, ~
## $ week7     <int> 47, 28, 25, 42, 38, 27, 31, 25, 23, 39, 36, 36, 19, 26, 30, ~
## $ week8     <int> 51, 28, 24, 46, 32, 25, 31, 24, 21, 32, 36, 31, 22, 26, 37, ~
## $ ID        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~

The data is in wide form format, it needs to be converted to long form to be used for analysis

  1. Convert the data to long form
BPRSL<- gather(BPRS, key = weeks, value = bprs, week0:week8) %>%
  #extract the week number- adding the variable week into the dataset
  mutate(week = as.integer(substr(weeks, 5, 5)))
#take a glimpse at the BPRSL data
glimpse(BPRSL)
## Rows: 360
## Columns: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ ID        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0~
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, ~
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

When we ignore the set of 9 bprs measurements come from the same subject, we have 360 bprs, weeks and treatment that can be easily analysed by using multiple linear regression

Plot the data, identifying the bprs observation in each treatment but ignoring the longitudinal nature of data

ggplot(BPRSL, aes(x = week, y = bprs, group = ID))+
  geom_text(aes(label = treatment))+
  #geom_line(aes(linetype = subject))+
  scale_x_continuous(name = "Week (number)", breaks = seq(0,8,1))+
  scale_y_continuous(name = "bprs ()") + 
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = 
          element_blank())

In general, BPRS decreases in both treatment 1 &2 as the number of weeks increased. Men who have higher BPRS values at the beginning tend to have higher values throughout the study.

Fit a Linear regression model

Fit a multiple linear regression model with bprs as response variable while week and treatment as explanatory variables, and ignpring the repeated-measures structure of the Data

#create a regression model BPRS_reg
BPRS_reg<- lm(bprs ~ week + treatment, data = BPRSL)
#summary 
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

From the model, the week variable is highly significant (p<0.5) and negatively correlated to BPRS measure. Implying that with increase in weeks, the BPRS decreased. Treatment 2 is not significant when holding Treatment 1 as a constant. The two explanatory variables explained approximately 18% of BPRS variability. This model (multiple linear regression) assumes independence of the repeated measures of BPRS, and thus assumption is highly unlikely.

9.3.2 Fitting Linear Mixed Models to the BPRS data

A graphical display of the bprs data taking into account the longitudinal structure of the data by joining together the bprs measures belonging to each man.

ggplot(BPRSL, aes(x= week, y = bprs, group= ID))+ 
  geom_line(aes(linetype= treatment))+
  scale_x_continuous(name= "Weeks (number)", breaks = seq(0,8,1))+
  scale_y_continuous(name = "bprs")+
  theme_bw()+ theme(legend.position = "top")+
  theme(panel.grid.major = element_blank(), panel.grid.minor = 
          element_blank())

From the plot, there is a general decline in bprs across all men as the number of weeks increased across both treatments.

A scatterplot matrix of repeated measures of BPRS to assess the independence of the repeated measures

pairs(BPRS[, 3:11], cex = 0.7)

From the scatter plot: The repeated measures of bprs are not independent of one another.They are correlated with one another.

Fitting Random intercept Model , with week and treatment as Expanatory variables

First fir the random intercept model and include the two explanatory variables- week and treatment. This kind of model allows the linear regression fir for each man to differe in intercept from other men

#create a random intercept model
BPRS_ref<- lmer(bprs ~ week + treatment + (1 | ID), data =  BPRSL, REML = FALSE)

#print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | ID)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2582.9   2602.3  -1286.5   2572.9      355 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27506 -0.59909 -0.06104  0.44226  3.15835 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 97.39    9.869   
##  Residual             54.23    7.364   
## Number of obs: 360, groups:  ID, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.3521  19.750
## week         -2.2704     0.1503 -15.104
## treatment2    0.5722     3.2159   0.178
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.256       
## treatment2 -0.684  0.000

The estimated variance of the subject random effects (97.39) is not very large indicating a small variation in the intercept of the regression fits of each subject (men) treatment profile. The estimated standard error of week is smaller when fitting random intercept compared to linear regression model.This suggest that increase in weeks leads to ignoring within-subject dependences, which reduces the error variance in the model.

Fit the Random intercept and slope model , with week and treatment as explanatory variables

The effect from the random intercept in this model allows the linear regression fits for each individual to differ in slope

BPRS_ref1<- lmer(bprs ~ week + treatment + (week|ID), data = BPRSL, REML = FALSE)
#print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | ID)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.2   2550.4  -1254.6   2509.2      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4655 -0.5150 -0.0920  0.4347  3.7353 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  ID       (Intercept) 167.827  12.955        
##           week          2.331   1.527   -0.67
##  Residual              36.747   6.062        
## Number of obs: 360, groups:  ID, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  45.9830     2.6470  17.372
## week         -2.2704     0.2713  -8.370
## treatment2    1.5139     3.1392   0.482
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.545       
## treatment2 -0.593  0.000
#perform an ANOVA test on the two models
anova(BPRS_ref1,BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | ID)
## BPRS_ref1: bprs ~ week + treatment + (week | ID)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## BPRS_ref     5 2582.9 2602.3 -1286.5   2572.9                         
## BPRS_ref1    7 2523.2 2550.4 -1254.6   2509.2 63.663  2  1.499e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above model gives a lower likelihood ratio test, AIC and BIC values compared to the earlier Random intercept model. Furthermore is has a small P value which is significant, these results indicates that random intercept and slope model provides a better fit for the data.

Fitting a random intercept and slope model that allows for a treatment x week interaction

BPRS_ref2<- lmer(bprs ~ week * treatment + (week|ID),
                 data = BPRSL, REML = FALSE)

#print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | ID)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.5   2554.5  -1253.7   2507.5      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4747 -0.5256 -0.0866  0.4435  3.7884 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  ID       (Intercept) 164.204  12.814        
##           week          2.203   1.484   -0.66
##  Residual              36.748   6.062        
## Number of obs: 360, groups:  ID, 40
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.9840  16.047
## week             -2.6283     0.3752  -7.006
## treatment2       -2.2911     4.2200  -0.543
## week:treatment2   0.7158     0.5306   1.349
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.668              
## treatment2  -0.707  0.473       
## wek:trtmnt2  0.473 -0.707 -0.668
#perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref2)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | ID)
## BPRS_ref2: bprs ~ week * treatment + (week | ID)
##           npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1    7 2523.2 2550.4 -1254.6   2509.2                    
## BPRS_ref2    8 2523.5 2554.6 -1253.7   2507.5  1.78  1     0.1821

The likelihood ratio test of the interaction random intercept and slope model against the corresponding model without interaction is -1253.7 with 2 degree of freedom (DF), with a big P-value that is not statistically significant. From these results, the interaction model does not provides a better fit for the men treatment data.

Find the fitted values from the interaction model and plot the fitted growth rates for each rat

#create a vector of the fitted values
Fitted<- fitted(BPRS_ref2)
#Create a new column fitted to BPRSL
BPRSL<- BPRSL%>% 
  mutate(Fitted)

#plot for the observed values
p1<- ggplot(BPRSL, aes(x = week, y = bprs, group = ID))+  
  geom_line(aes(linetype = treatment))+ 
  scale_x_continuous(name = "Week (number)", breaks = seq(0, 8,1))+
  scale_y_continuous(name = "Observed bprs ")+
  theme_bw()+ theme(legend.position = "right")+
  theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank())+
  ggtitle("observed")

#plot for the fitted values
p2<-ggplot(BPRSL, aes(x = week, y= Fitted, group = ID))+
             geom_line(aes(linetype = treatment))+
             scale_x_continuous(name = "Week (number)", breaks = seq(0,8,1))+
             scale_y_continuous(name = "Fitted bprs ")+
             theme_bw()+theme(legend.position = "right")+
             theme(panel.grid.major = element_blank(),
                   panel.grid.minor = element_blank())+
             ggtitle("Fitted")

#plot the observed and fitted
p1 +p2

The plot underlines how the interaction models predicted the observed data. The fitted plot includes the random effects of the model

date()
## [1] "Sun Dec 12 18:15:35 2021"

The #END